The take home portion of Exam 1 will be submitted via Canvas as an R Markdown file. Any question requiring a hypothesis test should include the following structure:


Problem 1

A local administrator wants to compare the starting salaries for teachers in Kentucky and Ohio. A sample of 40 KY teachers is taken and each is asked to report their current salary. Similarly, a sample of 55 OH teachers is taken and asked the same question. The data obtained is posted in a .csv file on Canvas.

\(\underline{\textbf{Parameters and Hypothesis Test}}\)

\(Q_{1(OH)} = First Quartile for Ohio teachers\)

\(Q_{1-KY} = First Quartile for Kentucky teachers\)

\(H_0: Q_{1(OH)} = Q_{1(KY)}\)

\(H_A: Q_{1(OH)} > Q_{1(KY)}\)

\(\alpha = 0.10\)

\(\underline{\textbf{Test Statistic}}\)

\(T(X) = Q_{1(OH)} - Q_{1(KY)}\)

library(tidyverse)
salaries = read_csv("salary.csv")

# Create vectors for OH and KY salaries
salaries.OH = filter(salaries, State == "OH") %>% select(Salary)
salaries.KY = filter(salaries, State == "KY") %>% select(Salary)

# Calculate first quantile for OH and KY
q1.OH = quantile(salaries.OH$Salary, probs = 0.25)
q1.KY = quantile(salaries.KY$Salary, probs = 0.25)

q1.obs.ts = q1.OH - q1.KY

The observed test statistic \(T(X) = Q_{1(OH)} - Q_{1(KY)}\) = 536.5

\(\underline{\textbf{The Null Distribution of T(X) and the observed test statistic}}\)

# Create vector for storing simulated test statistic
q1.simulated.ts = numeric(9999)

for (i in 1:9999){
  # Create vecotor of indicies (representing Ohio) for sampling without replacement randomly from salaries, then calculate simulated test statistic
  index = sample(95, 55, replace = F )
  q1.simulated.ts[i] = quantile(salaries$Salary[index], probs = 0.25) - quantile(salaries$Salary[-index], probs = 0.25) 
}

hist(q1.simulated.ts, main = "Null Distribution of T(X)", xlab = "Simulated Test Statistic: Q1(OH) - Q1(KY)")
abline(v = q1.obs.ts, col="blue", lwd = 3)

q1.pvalue = sum(q1.simulated.ts >= q1.obs.ts + 1) / 10000

Conclusion: The p-value for the hypothesis test is 0.0425. A p-value of 0.0425 suggests that the data collected would be unlikely to occur if the null hypothesis is true. At a 0.10 level of significance, there is evidence to reject the null hypothesis and conclude that the first quartile of Ohio teachers salary is greater than the first quartile of Kentucky teachers salary.

Doubling the sample size for the two states would not affect the center of the null distribution. This particular distribution would still be centered at 0. However the variability, or spread, of the distribution would decrease as the sample size increases.

par(new=T)

plot(density(salaries.OH$Salary, kernel = "gaussian", bw = sd(salaries.OH$Salary)/sqrt(length(salaries.OH$Salary))), main = "Comparison of Teacher Salaries in Ohio and Kentucky", sub = "Kernel Density Estiamte", xlim=c(35000, 45000), ylim=c(0, 0.0005), col="red", lwd=2)

par(new=T)
plot(density(salaries.KY$Salary, kernel = "gaussian", bw=sd(salaries.KY$Salary)/sqrt(length(salaries.KY$Salary))), main = "", sub = "", xlim=c(35000, 45000), ylim=c(0, 0.0005), col="blue", lwd=2)

A test statistic which could be used to test for equal variability would be the difference of the standard deviations for each population: \(T(X) = SD_{OH} - SD_{KY}\). The most direct way of testing for equal variances would be to test for a difference in the sum of the variances for the samples. Since the samples for Ohio and Kentucky are not the same size, standardizing the variances is neccessary making the choice of using the standard deviation intuitive. Taking a difference of the standard deviations would provide for a consistent way to determine if the variance is the same for each population. For variances that are approximately equal the null distribution should be centered around 0. Large values, either positive or negative, would provide evidence that the variances of the populations are not equal.

\(\underline{\textbf{Parameters and Hypothesis Test}}\)

\(SD_{OH} = Standard Deviation for Ohio teachers salary\)

\(SD_{KY} = Standard Deviation for Kentucky teachers salary\)

\(H_0: SD_{OH} = SD_{KY}\)

\(H_A: SD_{OH} \ne SD_{KY}\)

\(\alpha = 0.10\)

\(\underline{\textbf{Test Statistic}}\)

\(T(X) = SD_{OH} - SD_{KY}\)

# Observed test statistic
sd.diff.obs.ts = sd(salaries.OH$Salary) - sd(salaries.KY$Salary)

The observed test statistic \(T(X) = SD_{OH} - SD_{KY}\) = 37.0323766

\(\underline{\textbf{The Null Distribution of T(X) and the observed test statistic}}\)

# Vector for simulated test statistic
sd.diff.sim.ts = numeric(9999)

for (i in 1:9999){
  index = sample(95, 55, replace = F)
  sd.diff.sim.ts[i] = sd(salaries$Salary[index]) - sd(salaries$Salary[-index])
}

hist(sd.diff.sim.ts, main = "Null Distribution of T(X)", xlab = "Simulated Test Statistic SD(OH) - SD(KY)")
abline(v = sd.diff.obs.ts, col = "green", lwd = 3)

sd.diff.pvalue = sum(sd.diff.sim.ts >= sd.diff.obs.ts + 1) / 10000

Conclusion: The p-value for the hypothesis test is 0.4437. A p-value of 0.4437 suggests that the data collected would be likely to occur if the null hypothesis is true. Stated more simply, there is evidence to support the null hypothesis and assume that the variability of Ohio teachers salary and Kentucky teachers salary are approximately equal.

Problem 2

In class we considered the data collected from the Mythbusters’ experiment investigating contagious yawning. When investigating this myth, the hosts concluded their data confirmed the myth, which we found to be statistically incorrect.

Suppose that they plan to repeat the experiment but would like input on the sample size which should be used. Using the results from their initial experiment as a starting point, they would like to plan a similar experiment but with a sample size large enough to detect a difference of 4.41% between the two groups in yawning percentages (the difference found in the original experiment).

# Create function to call for calculating power
mb_power = function(size){

  pvalue <- numeric(500)
  # From the original experiment, roughly 2 people in seeded group and 1 person in the control.  Used   34/50 = .68 for the seeded and 16/50 = .32 for the control group to try to precisely reflect the      original experiment
  seed_size = size*.68
  control_size = size*.32
  
  for(i in 1:500){
    yawn_s <- rbinom(seed_size, 1, 0.29412)
    yawn_c <- rbinom(control_size, 1, 0.25)
    obs_ts <- sum(yawn_s)/seed_size - sum(yawn_c)/control_size
    
    permutation <- replicate(9999, sample(c(yawn_s, yawn_c), size))
    
    null_distn <- apply(permutation[1:seed_size,],2,sum)/seed_size - apply(permutation[(seed_size+1):size,],2,sum)/control_size
    
    pvalue[i] <- (sum(null_distn >= obs_ts)+1)/10000
  }
  
  sum(pvalue <= 0.05)/500
}

# Calculate power for each sample size
power.n50 = mb_power(50)
power.n250 = mb_power(250)
power.n500 = mb_power(500)
power.n750 = mb_power(750)
power.n1000 = mb_power(1000)
power.n1500 = mb_power(1500)
power.n3000 = mb_power(3000)
power.n5000 = mb_power(5000)

# Create data frame of sample size and the associated power
size = c(50, 250, 500, 750, 1000, 1500, 3000, 5000)
powr = c(power.n50, power.n250, power.n500, power.n750, power.n1000, power.n1500, power.n3000, power.n5000)

power.df = data.frame(size, powr)

power.plot = ggplot(power.df, aes(x = size, y = powr)) + geom_point() + geom_line() + ggtitle('MythBusters Example Power Curve') + xlab('Sample Size') + ylab('Power') + ylim(0, 1) + geom_hline(yintercept = 0.8, color = 'blue')

#library(plotly)
plotly::ggplotly(power.plot)
Total Sample Size # Assigned to Seeded Group # Assigned to Control Group Power
50 34 16 0.038
250 170 80 0.158
500 340 160 0.228
750 510 240 0.324
1000 680 320 0.374
1500 1,020 480 0.488
3000 2,040 960 0.816
5000 3,400 1,600 0.964
# Calculate slope and intercept for the line connecting 3000 and 5000 samples
slope = (power.n3000 - power.n5000) / (3000 - 5000) 
intercept = power.n3000 - slope*3000

recommend = (0.8 - intercept) / slope

seed.num = round(recommend*0.68)
control.num = round(recommend*.32)
# Create function to call for calculating power
mb_power_eqSize = function(size){

  pvalue <- numeric(500)
  # From the original experiment, roughly 2 people in seeded group and 1 person in the control.  Used   34/50 = .68 for the seeded and 16/50 = .32 for the control group to try to precisely reflect the      original experiment
  seed_size = size*.5
  control_size = size*.5
  
  for(i in 1:500){
    yawn_s <- rbinom(seed_size, 1, 0.29412)
    yawn_c <- rbinom(control_size, 1, 0.25)
    obs_ts <- sum(yawn_s)/seed_size - sum(yawn_c)/control_size
    
    permutation <- replicate(9999, sample(c(yawn_s, yawn_c), size))
    
    null_distn <- apply(permutation[1:seed_size,],2,sum)/seed_size - apply(permutation[(seed_size+1):size,],2,sum)/control_size
    
    pvalue[i] <- (sum(null_distn >= obs_ts)+1)/10000
  }
  
  sum(pvalue <= 0.05)/500
}

# Calculate power for each sample size
power2.n50 = mb_power_eqSize(50)
power2.n250 = mb_power_eqSize(250)
power2.n500 = mb_power_eqSize(500)
power2.n750 = mb_power_eqSize(750)
power2.n1000 = mb_power_eqSize(1000)
power2.n1500 = mb_power_eqSize(1500)
power2.n3000 = mb_power_eqSize(3000)
power2.n5000 = mb_power_eqSize(5000)

# Create data frame of sample size and the associated power
size2 = c(50, 250, 500, 750, 1000, 1500, 3000, 5000)
powr2 = c(power2.n50, power2.n250, power2.n500, power2.n750, power2.n1000, power2.n1500, power2.n3000, power2.n5000)

power2.df = data.frame(size2, powr2)

power2.plot = ggplot(power2.df, aes(x = size2, y = powr2)) + geom_point() + geom_line() + ggtitle('MythBusters Example Power Curve - Equal Sample Size') + xlab('Sample Size') + ylab('Power') + ylim(0, 1) + geom_hline(yintercept = 0.8, color = 'blue')


plotly::ggplotly(power2.plot)
Total Sample Size # Assigned to Seeded Group # Assigned to Control Group Power
50 25 25 0.066
250 125 125 0.168
500 250 250 0.254
750 375 375 0.32
1000 500 500 0.426
1500 750 750 0.598
3000 1500 1500 0.824
5000 2500 2500 0.976

Conducting an experiment in which each group has equal sample sizes would result in a more powerful experiment. Using this approach would allow for the same amount of power to be generated with fewer samples.

Problem 3

Is there a difference in the price of groceries sold by Target and Wal-Mart? The data posted in the .csv file on Canvas contains a sample of grocery items and their prices advertised on their respective web sites on one specific day.

\(\underline{\textbf{Parameters and Hypothesis Test}}\)

\(Price_{T} = Price of Target grocery\)

\(Price_{WM} = Price of Wal-Mart grocery\)

\(H_0: Price_T = Price_{WM}\)

\(H_A: Price_T \ne Price_{WM}\)

$= 0.10

\(\underline{\textbf{Test Statistic}}\)

\(\bar{Y}_d = \frac{\sum (Price_T - Price_{WM})}{number of grocery items compared}\)

\(\underline{\textbf{Null Distribution}}\)

# Read in Groceries data
groceries = read_csv("Groceries.csv")

# Calculate the difference in the observed prices, then calculate the mean of the difference to find the value of the observed test stastitic
groc.diff.obs = groceries$Target - groceries$Walmart
groc.obs.ts = mean(groc.diff.obs)

groc.sim.ts = numeric(9999)

for (i in 1:9999){
  sign = sample(c(-1, 1), 30, replace = T)
  groc.diff.sim = sign * groc.diff.obs
  groc.sim.ts[i] = mean(groc.diff.sim)
  
}

hist(groc.sim.ts, main = "Null Distribution of Y-bar", xlab = "Simulated Test Statistic Price(T) - Price(WM)")
abline(v=groc.obs.ts, col="purple", lwd=3)

groc.pvalue = (sum(groc.sim.ts >= groc.obs.ts) + 1) / 10000

Conclusion: The p-value for the hypothesis test is 0.3489. A p-value of 0.3489 suggests that the data collected would be likely to occur if the null hypothesis is true. At a 0.10 level of significance, there is evidence to support the null hypothesis and assume that the prices for groceries at Target is approximately equal to the prices of groceries at Wal-Mart.